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Abstract 

The presence of different transcripts of a gene across samples can be analysed by whole-transcriptome microarrays. 
Reproducing results from published microarray data represents a challenge owing to the vast amounts of data and 
the large variety of preprocessing and filtering steps used before the actual analysis is carried out. To guarantee a 
firm basis for methodological development where results with new methods are compared with previous results, 
it is crucial to ensure that all analyses are completely reproducible for other researchers. We here give a detailed 
workflow on how to perform reproducible analysis of the GeneChip®Human Exon 1.0 ST Array at probe and probe- 
set level solely in R/Bioconductor, choosing packages based on their simplicity of use. To exemplify the use of the 
proposed workflow, we analyse differential splicing and differential gene expression in a publicly available dataset 
using various statistical methods. We believe this study will provide other researchers with an easy way of accessing 
gene expression data at different annotation levels and with the sufficient details needed for developing their own 
tools for reproducible analysis of the GeneChip® Human Exon 1.0 ST Array. 
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INTRODUCTION 

In the field of microarrays, it has traditionally been 
difficult to compare new methods with already 
established and published methods, as different 



strategies for preprocessing, summarizing and filter- 
ing make it almost impossible to work with the exact 
same data, even when the raw data is made available. 
That is why we consider reproducible research 
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of fundamental importance as it will facilitate easy 
(i) revision of articles, (ii) access to data and results, 

(iii) communication with other researchers and 

(iv) comparison between different methods. 
Reproducible research is gaining relevance among 
the scientific community as shown by the number 
of articles published on the subject during the past 
years [1-4]. loannidis etal. showed that the results of 
only 2 of 18 published microarray gene-expression 
analyses were completely reproducible [5]. This is 
why some authors demand that documentation 
and annotation, database accessions and URL links 
and even scripts with instructions are made publicly 
available [2]. Journals like Biostatistics have even 
appointed an Associate Editor for reproducible 
research, but still treat it as a 'desirable goal' rather 
than a requirement [6]. Setting up a framework for 
reproducible research necessarily implies working 
with free and open-source software, for example, 
R/Bioconductor [7, 8]. Additionally, using S weave 
[9, 10] (a tool for embedding R code in 
LATEXdocuments [11]) enables automatic reports 
that can be updated with output from the analysis. 

The main tool in this article will be the 
Bioconductor package aroma. affymetrix [12] that 
can analyse all Affymetrix microarray types with a 
Chip Definition File (CDF file). Affymetrix some- 
times refer to their micro arrays as 'chips'. The 
number of arrays (samples) that can be simulta- 
neously analysed by aroma. affymetrix is virtually 
unlimited, as the system requirements are just 1 GB 
PJVM, for any operating system [13]. This package is 
freely available and can easily be installed into R. The 
aroma. affymetrix website www.aroma-project.org is 
conceived as reference for all the possible micro arrays 
that can be analysed with aroma. affymetrix, and 
does not focus specifically on the analysis of the 
GeneChip® Human Exon 1.0 ST Array (or exon 
array in short). Portable scripts for a fast and basic 
analysis can be obtained on request to aroma. affy 
metrix's authors. 

The analysis of exon array data in R/ 
Bioconductor is not yet standard. There are several 
packages available, and it can be a tremendous effort 
for a newcomer to maneuver between them and to 
overcome the numerous challenges associated with 
these packages. This article aims to make this task 
easier and to provide a quick reference guide to 
aroma. affymetrix's documentation. We also explain 
how to extract data for different statistical analyses 
and propose a method for gene annotation and for 



gene profile visualization. For this last step, we use 
the packages biomaRt and GenomeGraphs to annotate 
and visualize the transcripts in a genomic context. 
When using the workflow, please remember to 
cite packages aroma. affymetrix, GenomeGraphs and 
biomaRt. 

In this article, we sketch the proposed workflow, 
which is carried out solely in R/Bioconductor, and 
explain some key sections. The complete code is 
available as a S weave (.Snw) [9] document that will 
allow the reader to reproduce our exact results. The 
.Snw document can also be converted into an R 
script and executed. The workflow starts by reading 
in the data, followed by background correction and 
quantile normalization. We then explain how to 
obtain transcript cluster-, probeset- and probe-level 
estimates. Afterwards, different methods for the sta- 
tistical analysis of differential splicing or differential 
gene expression are reviewed. Finally, we make a 
suggestion on how to annotate transcript clusters to 
genes in the lists obtained from the statistical analyses, 
and how to plot the data including genomic infor- 
mation. To exemplify the use of the workflow, an 
example dataset [14] is analysed along the way. 

BACKGROUND ON 
ALTERNATIVE SPLICING 

Splicing is the post-transcriptional process that gen- 
erates mature eukaryotic messenger RNAs 
(mRNAs) from pre-mRNAs by removing the 
non-coding intronic regions and joining together 
the exonic coding regions. For many genes, two or 
more splicing events take place during maturation of 
mRNA molecules, resulting in a corresponding 
number of alternatively spliced mRNAs. These 
mature mRNAs translate into protein isoforms dif- 
fering in their amino acid sequence and ultimately in 
their biochemical and biological properties [15, 16]. 
Alternative splicing is one of the main tools for gen- 
erating RNA diversity, contributing to the diverse 
repertoire of transcripts and proteins [16, 17]. It is 
known that 92—94% of multi-exon human genes are 
alternatively spliced and that 85% of those have a 
minor isoform frequency of at least 15% [18, 19]. 
In our case, we will focus on the detection of differ- 
ential splicing between groups, as for instance tissue 
types, or healthy versus diseased samples. 

The exon array was presented in October 2005 
as a tool for the analysis and profiling of whole- 
transcriptome expression [20—22]. To interrogate 
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each potential exon with at least one probeset, the 
exon array contains about 5.6 million probes 
grouped into >1.4 million probesets (most probesets 
consisting of four probes), which are further grouped 
into 1.1 million exon clusters, or collections of over- 
lapping exons. Finally, exon clusters are grouped into 
>300 000 transcript clusters to describe their rela- 
tionship, for example, shared splice sites or overlap- 
ping exonic sequences. Each gene is covered, on 
average, by 40 probes interrogating regions located 
along the entire gene [23]. This probe positioning 
aims at providing better estimates of gene expression 
levels than previous arrays, and allows for the study 
of differential splicing [24] and differential gene 
expression based on summarized exon expression. 

The exon array has three levels of annotation for 
the interrogated transcript clusters: core, extended and 
full [25]. Core transcript clusters are supported by 
the most reliable evidence such as RefSeq transcripts 
and full-length mRNAs [26], and a core transcript 
cluster is roughly a gene [27]; the extended level 
contains the core transcript clusters plus complemen- 
tary DNA (cDNA)-based annotations [28], and the 
full level contains the two previous levels plus 
ah initio, or algorithmic, gene predictions [29]. It is 
worth noting that aroma. affymetrix enables the 
analysis at the three levels of annotation mentioned 
above, and also that it provides intensity estimates for 
probes, probesets and transcript clusters, allowing for 
a variety of options for the analysis. 



WORKFLOW 

Our workflow for the analysis of exon array data 
starts by setting up the required folder structure for 
aroma. affymetrix. The data are then preprocessed 
and summarized at transcript cluster and/ or probeset 
level. Next, transcript clusters are analysed with sev- 
eral statistical models to detect differential expression 
or splicing, and the transcripts of interest are anno- 
tated and visualized at the end (Figure 1). In the 
code, places where user input is needed are marked 
by and places where the user can choose 

whether to modify parameters are marked by 

To exemplify the use of the tutorial, we have used 
Affymetrix's colon cancer dataset [14], consisting of a 
collection of paired samples of colon tumour tissue 
and adjacent normal tissue from 10 patients and 
available at http : //www . affymetrix . com/ support/ 
techiiical/sample_data/exoii_array_data. af f X. 
According to Affymetrix's website, the RNA 



samples are from a commercial source. This dataset 
has been used in a number of articles to evaluate the 
performance of different analysis methods [30—33], 
and a number of genes have been validated to be 
differentially spliced or not [14]. The analysis was 
performed in R version 2.15.1 (32 bit). 

Start by installing and loading aroma. affymetrix 
in R and loading the other libraries required: 

> source ' http :/ /aroma-pro j ect . org/hbLite . R' ' ) 

> hblnstall ( ' ' aroma . affymetrix' ' ) 

> require (aroma . affymetrix) 

> require (biomaRt) 

> require (GenomeGraphs ) 

Setting up the structure and files for the 
analysis workflow 

This section corresponds to steps 1 and 2 in Figure 1. 
The first step is to create the folder structure: 
under a main folder of our choice — 'myworking 
Directory' — we wiU create the 'rawData' and 
'annotationData' folders, which will be common to 
all aroma. affymetrix projects. Inside 'annotation 
Data', the subfolder 'chipTypes' will contain one sub- 
folder per chip type, with the exact name of the .CDF 
file provided by Affymetrix, 'HuEx-l_0-st-v2' in our 
case. Inside this folder, we will save any library and 
annotation files that might be needed. Besides, the 
'myDataSet' folder will be created under 'rawData' 
to store .CEL files. These files are the output of a 
microarray experiment and contain the result of the 
intensity calculations per probe or pixel. Note that the 
microarray experiment produces one .CEL file per 
array and that one array analyses one sample. Note 
also that we need one 'myDataSet' folder per experi- 
ment and that myDataSet' will be added as a tag at the 
end of the aroma, affymetrix output. 

> user-defined working directory 

> wd <- ' ^ myWorkingDirectory' ' 

> user-defined data set name 

> ds <- ' 'myDataSet ' ' 

In the second step, we save our library (.CDF in 
our case) and .CEL files in the corresponding folders. 
Affymetrix's unsupported CDF files can be down- 
loaded from http://www.affymetrix.com/Auth/ 
support/downloads/library_files/HuEx-l_0-st-v2. 
cdf zip; note that registration is needed. For the exon 
array, Elizabeth Purdom has created a number of 
binary .CDF files based on Affymetrix's text CDF 
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I. Folder structure set up 6. Annotation and visualization 5. Statistical analysis 




2. Library, annotation and .CEL files 3. Data preprocessing and summarization 4. Extracting intensities 

i I I 

Figure I: Flowchart for an analysis with aroma. affymetrix, read counterclockwise starting in upper left corner: 
I. and 2.; folder structure set-up including library, annotation and .CEL files; 3. data preprocessing and summariza- 
tion; 4. extraction of intensities at transcript cluster, probeset and probe level, including filtering recommended by 
Affymetrix; 5. statistical analysis of differentially expressed or spliced transcript clusters and 6. annotation and visua- 
lization of transcript cluster profiles. Blue boxes (step 3.) represent parts of the analysis implemented in aroma. af- 
fymetrix, yellow and green boxes are part of the code provided in this article and purple boxes represent user- 
input needed (steps I., 2. and analysis of differential splicing with FIRMA). Output produced at several steps is 
saved in user-chosen 'output .folder' and represented by a star shape in the workflow. A number of folders are 
automatically generated by aroma. affymetrix (represented by a faded yellow rectangle in 3.); our worl<flow does 
not make use of the contents of such folders. 



file [13] that are faster to query and more memory 
efficient. Such binary .CDF files for core, extended 
and full sets of probesets can be downloaded from 
http://aroma-project.org/node/122. In the exam- 
ple below, we use the custom aroma file for core tran- 
script clusters, which might be updated in the future. 
Our original .CEL files will be copied from the user- 
specified 'myCELf ileDirectory' into the exon 
'rawData' subfolder (the code is part of the .Snw ver- 
sion of this article). The desired output folder specified 
in 'output .folder' should exist in advance. 

> download user-defined library file 

> library . file <- 

+ paste (annotation . data . exon, 
+ ' 'HuEx-l_0-st-v2 ,coreR3 , 



+ A20071112 ,EP.cdf' ' , sep = ' W ' ) 

> download. address <- 

+ ' 'http://bcgc.lbl.gov/cdfFiles/'' 

> file <- paste ( ' 'HuEx-l_0-st-v2 , A2 007 1112 , EP' ' , 
+ ' 'HuEx-l_0-st-v2 , coreR3 , 

+ A20071112 ,EP.cdf' ' , sep = ' W ' ) 

> custom. cdf <- 

+ paste (download. address , file, sep = '' 

> download . file (url = custom. cdf , 
+ dest file = library . file, 

+ mode = ' 'wb' ' , quiet = FALSE) 

> user-defined directory containing .CEL 
+ files 

> eel . directory <- ^^myCELfileDirectory'' 

> user-defined output folder 

> output . folder <- ^ ^ output . folder ' ' 
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In addition, the sample information should be 
saved in a tab-separated file with column names 
celFile, replicate and treatment containing 
.CEL file name (without .CEL), replicate identifier 
and treatment name, respectively. This file should be 
called 'SampleInformation.txt', and it will be copied 
from the user- specified directory into the '\rawData\ 
myDataSet \HuEx-l_0-st-v2 ' folder (which will also 
contain the .CEL files) after it has been created. The 
sample information file for the colon cancer example 
is attached as an additional file. 

> sample . info <- 

+ read, table (file = paste (raw. data . exon, 

+ Samplein formation . txt , sep = 

+ sep = ' ' \ t ' ' , header = TRUE) 

Finally, NetAffx transcript clusters' and probesets' 
annotation files should be saved in 'annotationData/ 
chipTypes/HuEx-l_0-st-v2'. We have used release 
32, which was most up to date at the time of writing, 
and we downloaded files 'HuEx-l_0-st-v2.na32. 
hgl9.transcript.csv.zip' and 'HuEx-l_0-st-v2.na32. 
hgl9.probeset.csv.zip' from http: //www.affymetrix 
. com/estore/browse/products . j sp?prociuctId=131 
452&categoryId=35676&productName=GeneChip-Hu 
man-Exon-ST-Array#l_3, Technical Documentation tab, 
under NetAffx Annotation Files. The extracted .csv 
files should be converted into .Rdata files for query- 
ing them faster in the future. Note that the number 
of lines to skip might differ for future annotation 
files. 



> transcript . clusters .NetAffx. 32 <- 

+ read. csv(file = paste (annotation .data. 

+ exon,HuEx-l_0-st-v2 .na32 .hgl9 . 

+ transcript . CSV sep = 'W''}, skip=24) 

> probesets .NetAffx. 32 <- 

+ read, csv (file = paste (annotation .data, exon, 

+ ' 'HuEx-l_0-st-v2.na32.hgl9. 

+ probeset . CSV ' ' , sep = 'W''}, skip=23) 



Data preprocessing and summarization 
to probeset/transcript cluster level 

After defining chip type and dataset, background 
correction and quantile normalization are carried 
out as shown in Figure 1, step 3. In these preproces- 
sing steps, it is possible to use either Affymetrix's 
original .CDF file, the .CDF files provided by the 
aroma project (the file for core transcript clusters in 
our example) or a .CDF file created by the user. The 



summarization step, however, must be done using 
one of the custom .CDF files available at the 
aroma. affymetrix project website. 

Background correction as defined by Irizarry [34] 
and quantile normalization are performed by 
the RmaBackgroundCorrectionO and Quantile 
Normalization 0 functions, respectively. The raw, 
background corrected and quantile normalized 
probe intensities can be visualized using the 
plotDensityO function applied to the correspond- 
ing object. Summarization is done with the 
ExonRmaPlm function [35]. The parameter 
mergeGroups determines whether to summarize at 
transcript level (TRUE) or probeset level (FALSE). All 
the functions described automatically create sub- 
folders such as 'plmData' or 'probeData' inside 
'myWorkingDirectory'. A more detailed version of 
this code with interesting comments about the 
choice of .CDF and possibilities for quality control 
is available at http : / / www . aroma-pro j ect . org/ 
vignettes/FIRMA-HumanExonArray Analysis. 

> chipType <- ' ' HuEx-l_0-st-v2 ' ' 

> user-defined . CDF file : change tags 
+ parameter 

> cdf < - AffymetrixCdfFile $ byChipType , tags= 
+ (chipType, ' ' coreR3 , A2007 1112 , EP ' ' ) 

> cs < - AffymetrixCelSet$byName (ds, cdf = cdf) 
> 

> # background correction 

> be <- RmaBackgroundCorrection (cs) 

> csBC <- process (be, verbose = verbose) 
> 

> ^ quantile normalization 

> qn <- QuantileNormalization (csBC , 

+ typesToUpdate = ' 'pm' ' ) 

> csN <- process (qn, verbose = verbose) 
> 

> # summarization 

> # transcript cluster level 

> plmTr <- ExonRmaPlm (csN, mergeGroups = TRUE, 



+ tag = ' ' coreProbesets 

+ GeneExpression ' ' ) 

> # probeset /exon level 

> plmEx <- ExonRmaPlm (csN, mergeGroups = FALSE, 
+ tag= ' ' coreProbesetsExon 

+ Expression ' ' ) 



Extraction of intensity estimates 

The aroma. affymetrix documentation focuses on 
analyses at the probeset and transcript cluster levels. 
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The respective intensities are obtained by applying 
the function getChipEf f ectSet () to the transcript 
or probeset plm objects (plmTr and plmEx, respec- 
tively) and then extracting the corresponding 
dataframes. However, it is also possible to extract 
the background-corrected and quantile-normalized 
intensities of all probes using the function 
getUnit Intensities. While plmTr is suitable for 
the FIRMA analysis, plmEx is well suited for 
probeset-level analysis. For the ANOSVA probe 
analysis described in the statistics section, we have 
created one list of dataframes containing probe 
intensities per transcript cluster, and another list of 
dataframes containing probeset intensities per cluster 
(code included in .Snw). 

> # extract a matrix of gene intensities 

> cesTr <- getChipEffectSet (plmTr) 

> trFit <- extractDataFrame (cesTr , 
+ addNames=TRUE) 

> # extract a matrix of probeset intensities 

> cesEx <- getChipEffectSet (plmEx) 

> exFit <- extractDataFrame (cesEx, 
+ addNames=TRUE) 

> # extract a list of probe intensities per gene 

> unitlntensities <- 

readUni ts ( csN, verbose=verbose) 

The high number of transcript clusters analysed 
in combination with the usually small number of 
chips tends to cause a high number of false positives 
[25]. To reduce the number of false positives, 
Affymetrix recommends to perform detection 
above background (DABG) [36] on the dataset 
before the analysis [25]. The DABG procedure is 
not implemented in aroma, affymetrix, so we 
decided to follow the procedure described in [30] 
and use 3 as a threshold for the probeset intensity, 
so that probesets with a log2 intensity below 3 will 
be marked as absent. Except for this change, we 
followed the guidelines proposed in [25] to remove 
absent transcript clusters and probesets, where 
neither probesets that are absent in more than 
half of the samples of a group nor transcript clusters 
with more than half of the probesets absent 
are analysed. 

Besides this filtering based on expression levels, 
another filtering step that removes probesets present- 
ing cross-hybridization is advisable [25]. Cross- 
hybridizing probesets are identified in file 
'HuEx- l_0-st-v2 .na32 .hgl 9 .probeset. csv' and 



removed. Affymetrix recommends to filter them 
out after the analysis, but we have decided not 
to include them in the analysis to narrow down 
the number of probesets/transcript clusters to 
investigate. 

The filtering procedure is part of the .Snw file. In 
our example, where we analysed only core probesets, 
136 233 probesets of 284 258 were deemed present 
by our filter, and the number of transcript clusters to 
analyse (present in both samples) was reduced from 
18 708 to 8401. 

Statistical analysis 

In this section, we give an overview of model-based 
statistical methods available for the analysis of differ- 
ential splicing and suggest a method for the analysis 
of differential gene expression, and the analyses are 
done gene wise. 

Differential splicing 

The models used in this article are extensions of the 
linear model by Li and Wong [37] 

Ypt = + (3, + Spt, (1) 

where y^^ is the intensity measure of probe p for 
treatment t, dp is a probe affinity term, (3^ is the 
gene-level estimate for treatment t and Spt is the 
error term. 

ANOSVA (Analysis of Splicing Variation) was 
presented by Cline et al. [38] and is a two-way 
ANOVA model with probeset and treatment as 
factors: 

Ypet = |i + OCe + (3, + y,, + Spet. (2) 

where y^^^ is the intensity measure of probe p in 
probeset e and treatment t, the overall mean |i is 
the baseline level of all probes in all experiments 
and OCg and (3^ are the probeset and treatment 
effects. The interaction term y^r indicates whether 
the effect of the probeset depends on the treatment 
and is therefore key to the detection of differential 
splicing. 

The model in (2) can be extended to include 
random effects associated to replications from the 
same individual, r: 

ypetr = |i + OCe + + y,, + + Q, + Spetr (3) 

where 1^ is the random effect of each individual r 
and Cfy is the random chip effect. The error terms 
Epetr are independent, identically distributed (i.i.d.) 
N(0,a^)-distributed. 
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Under the null hypothesis of no differential spli- 
cing, the y^^s will all be zero, and therefore we con- 
sider the test statistic 



where y^^ is the estimate for y^^, and 6" is the standard 
error of y^^. Large values will be critical for the null 
hypothesis. Under the model assumptions, t will 
follow a tp^-T-(ne-\-R-i) distribution [39, Chap. 5], 
with N the total number of observations per tran- 
script cluster, T the number of treatments, the 
number of probesets in the transcript cluster and R 
the number of individuals. For each gene, the smal- 
lest P-value from the above ^-tests is regarded as a 
measure of confidence that the gene is differentially 
spliced across the experimental conditions [38]. The 
interaction estimates and variances and thus the test 
statistics are contrast dependent, so choosing a differ- 
ent contrast will alter the gene lists. In our analysis, 
we have used the sum contrast available in R, where 
parameter estimates are centred around zero. 

We use Im to fit the model in equation (3) to 
the probe-level estimates obtained using unit 
Intensities 0 from aroma, affymetrix to 8075 
multi-exonic transcript clusters. Here, we only 
show the code corresponding to equation (3), the 
rest of the code is part of the .Snw file. 



> # '^'^ user-defined parameters for linear model 

> Im <- 

+ Im (intensity ^ probeset + treatment + 

+ C (probeset : treatment , contr . sum) 

+ replicate /treatment 

> n. probesets <- 

+ length (unique (dataframe$probeset) ) 

> main . effects <- 

+ 1 + (n .probesets - 1) + 

+ (length (unique (data frame 

+ $treatment) ) - 1) 

> DS .parameters <- (n .probesets - 1) * 
+ (length (unique (data frame 
+ $ treatment) ) - 1) 

> p.t <- 

+ min ( summary ( Im) $ 

+ coefficients [ (main . effects+1 ) : 

+ (main . effects + DS .parameters) , 

+ ^^Pr(>\t\) '']) 



Although the vast majority of probesets contain 
four probes, transcript clusters containing probesets 
with less than four probes will give rise to an 



unbalanced design. Nevertheless, the ^-distribution 
is almost a normal distribution for long transcript clus- 
ters so the unbalanced design does not have any prac- 
tical implications. For shorter transcript clusters, 
however, an unbalanced design might be a problem. 

The top 10 most differentially spliced genes, sorted 
by the minimum P-value of their ^-tests, appear in 
Table 1. The gene Z^Kwas validated as differentially 
spliced in [14]. See Figure 2 for the profile plot of 
KLKW, where the thick lines representing the mean 
intensity in each group have been plotted for easing 
the interpretation. Note that there is one measure- 
ment per probe in each probeset, typically four probes 
per probeset. How to obtain such plots is described in 
the annotation and visualization section below. The 
genes in Figures 2, 3, 4 and 6 were chosen because 
they span over a shorter genomic region and show a 
clearer picture of the relationship between probesets 
and exons than the other genes in the lists of top 10 
genes. 

A slight variation of ANOSVA is the probeset 
model as implemented in Partek [40] (note that the 
probe subscript p has been removed): 

Yetr = |I + ae + (3^ + y,^ + /r + Qr + ^etr- (4) 

For the probeset-level ANOSVA, we used the 
probeset-level estimates obtained by affyPLM(..., 
mergeGroups = FALSE). After filtering for non-pre- 
sent or cross-hybridizing probesets, and absent tran- 
script clusters, we were left with 8189 transcript 
clusters to study. These clusters are sorted according 
to the minimum P-value of the individual t-test 
scores for differential splicing, and the top 10 genes 
obtained appear in Table 2. The genes MMPll, ZAK 
and COMP are in the top 10 genes for both the 



Table I: Top 10 differentially spliced genes 
with minimum P-values from the ANOSVA 
probe model, equation (3) 



Gene symbol 


min P-value 


MMPll 


2.98e- 10 


SOX9 


2.59e - 09 


FOXQI 


6.38e - 09 


KLKIO 


3.84e - 08 


SYNM 


4.53e - 07 


PHLDAI 


4.85e - 07 






SNTBI 


5.33e - 07 


COMP 


6.85e - 07 


XPOT 


7.87e - 07 



Genes highlighted in grey appear in Figures 2 and 5. 
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Figure 2: Profile plot of gene KLKIO with the gene model and possible transcripts retrieved from EnsembI [50]. The 
(— ) next to the gene name indicates that it is located on the reverse strand. The mean intensities of each group 
are plotted with a thicker line. Exons 5 (mapped by probesets 4 and 5) and 8 (mapped by probeset 9), counted 
from the 5^ end, seem to be higher expressed in tumour samples (blue/dashed) than in normal (red/solid). 




Figure 3: Profile plot of TGFBI with the gene model and transcripts retrieved from EnsembI [50]. The gene is on the 
forward strand as indicated by (+) next to its name. The mean intensities of each group are plotted with a thicker 
line; note that only one estimate is plotted by probeset, and it corresponds to the estimate computed by 
ExonRmaPlmC. . . , mergeGroups=FALSE) . Here, it seems like the tumour samples (blue/dashed) present increased 
expression from exon 3 until the end of the transcript, with respect to normal (red/solid) samples. Given that 
TGFBI is on the forward strand and that the difference is at the beginning of the transcript, we might be observing 
a case of alternative promoter usage. 
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Figure 4: Profile plot of gene LGALS4 with the gene model and transcripts retrieved from EnsembI [50], the gene is 
on the reverse strand. Interestingly, this gene only has one transcript. We might then be facing a novel splicing 
event or, more likely, a false positive detected by the method. 



Table 2: Top 10 differentially spliced genes 
with minimum P-values from the ANOSVA 
probeset model, equation (4) 



Gene symbol 


min P- value 


SOX4 


9.07e- 15 


MMPII 


9.48e- 14 


ZAK 


4.40e- 13 


FOXQI 


5.98e- 13 


TGFBI 


4.83e- 12 


UBAP2L 


8.l7e- 12 


COMP 


2.87e - 1 1 


SLC2AI 


4.l6e- II 


CDHII 


4.l9e- 10 


CPXMI 


4.47e- 10 



Genes highlighted in grey appear in Figures 3 and 5. 



ANOSVA probe and the ANOSVA probeset 
models. Gene TGFBI appears in Figure 3. 

FIRMA (Finding Isoforms using Robust 
Multichip Analysis) was first introduced by Purdom 
etal [30] for the exon array. In presence of differential 
splicing, the model in (1) will not fit and this will 
show up in the residuals. The linear model used is 
the following: 

Ypetr — H" ^tr ~^ ^petr (5) 

with dp the probe affinity, (3^^ the gene-level 
effect for chip tr and the error terms Sp^tr are i.i.d. 
N(0,a^)-distributed. Note that in contrast to the 
model in (3), we do not compute an overall gene- 
level estimate, but a gene-level estimate per chip. 



In model (5), like in model (1), there is no treat- 
ment/probeset interaction term, so differential spli- 
cing is analysed probesetwise using the residuals 
per probeset e: 

^petr — Ypetr i^p H" P^r)? 

j9 = 1, . . . r = 1, . . . , T, r = 1, . . . , R 

where OLp and are the estimates of and P^^. 

The median of the standardized residuals per 
probeset per chip is chosen as score statistic: 

F.(tA = median! — if^LA rj\ 

The standardization with the median absolute 
deviation of the residuals per gene makes the scores 
comparable across genes. 

The FIPJMA scores are extracted from the plmTr 
object obtained in the data pre-processing and sum- 
marization step. All probesets and transcript clusters 
were analysed by FIRMA, as it is part of the default 
aroma. affymetrix workflow. 

> firma <- FirmaModel (plmTr) 

> fit (firma, verbose = verbose) 

> fs< - getFirmaScores (firma) 

> firma . scores <- extractDataFrame ( fs ) 

After obtaining the FIPJMA scores per probeset per 
sample, we proceeded as described in [30]: (i) For 
each probeset, we took the difference of FIPJVLA 
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scores for each of the 10 pairs of normal/cancer sam- 
ples and (ii) calculated the mean of the 10 differences 
per probeset. Then (iii) we ranked the probesets 
according to their absolute mean difference: as the 
scores are comparable across transcript clusters, 
larger average differences between the normal and 
cancer samples will point at exons more differentially 
spliced between the two conditions. The resulting list 
was filtered to keep only probesets and transcript clus- 
ters that had passed the filter described in the filtering 
section above; the code is part of the .Snw file. To get 
a gene list instead of a probeset list as in [30], we 
mapped probesets to transcript clusters and then 
selected the top 10 genes on the list (Table 3). The 
profile plot of gene LGALS4 appears in Figure 4. The 
high average difference for this gene is owing to a 
FIRMA score of 830 030.8 at probeset 3861578 cor- 
responding to the tumour sample of replicate 7. 

Of 14 genes investigated for differential splicing in 
([14], Table 1), 10 passed our filtering procedure and 
were analysed for differential splicing: ACTNl, VCL, 
CALDl, SLC3A2, COL6A3, CTTN, FM, MAST2, 
ZAKmd FXYD6. The gene Z/IK appears in the top 
10 genes from ANOSVA probe and ANOSVA pro- 
beset, but a plot is not produced automatically by the 
code because ZAK is not recognized by BioMart. 
Instead, we looked the gene up in PubMed obtaining 
the Ensembl ID: ENSG00000091436 and used this to 
plot the gene (Figure 5). The gene COL6A3 appears 



among the top 100 genes for the ANOSVA probe and 
the ANOSVA probeset methods. Gene ACTNl is 
among the top 100 genes for ANOSVA probeset 
and it is also the first of Gardina's genes to appear in 
the filtered FIRMA list, at position 154. 

The differences in the gene lists obtained with 
ANOSVA and FIPJVIA are caused by the distinct 
nature of the two methods. ANOSVA was designed 
to look for splicing changes that are consistent within 
replicate sets, and differential splicing is assessed 
by the significance of a statistical test. FIRMA is a 
robust method that can detect splicing chances not 
necessarily consistent within replicate sets and so 



Table 3: Top 10 differentially spliced genes and their 
top scores from the FIRMA model, equations (5) to (7) 



Gene symbol 


Average difference 


LGALS4 


83 000.0 


HMGCS2 


82.9 


RPS6KAI 


31.8 


MUCI3 


28.2 


RPL35 


24.5 


RPL35A 


18.8 


SLC39AI4 


18.2 


TGFBI 


16.2 


COL23AI 


15.5 


SFT2D2 


14.2 



Gene LGALS4, highlighted in grey, appears in Figure 4. 



ZAK (+) 
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Figure 5: Profile plot of gene ZAK with the gene model and transcripts retrieved from Ensembl [50]. The gene is on 
the forward strand. There seems to be a differential splicing event identified by probesets 35 and 36 (counted from 
the 5^ end of the gene), corresponding to exons 19 and 20. 
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does not explain how to summarize from exon- 
sample scores to overall gene-level scores or whether 
it is recommended to do so. This implies some 
arbitrariness in the summarization procedure, which 
makes FIPJVLA less reproducible than ANOSVA. A 
thorough benchmarking of ANOSVA, FIPJVIA 
and several other methods for detecting differential 
splicing can be found in [41]. The comparison is 
done by means of receiver operating characteristic 
curves on two datasets: a panel of 11 human tissues 
with confirmed alternative splicing events; and a 
modification of a spike-in experiment where 25 
transcripts were hybridized to HeLa cells [42, 43]. 
In most of the experiments carried out, FIPJVIA 
seems to perform better than ANOSVA. In [44] 
and in [45], ANOSVA and FIPJVIA are compared 
with other methods, respectively, but not with each 
other. 

Differential gene expression 

In this section, we analyse differential gene expres- 
sion using probe-level data. We study two types of 
transcript clusters: (i) the ones not included in the 
ANOSVA probe analysis above (with only one 
probeset or not present in both normal and 
cancer groups) and (ii) the ones not showing differ- 
ential splicing (ANOSVA probe P-value above 
0.1). The analysis of group (ii) is based on the 
hierarchical principle: only look for significant 
main effects (differential expression in this case) 
among those transcript clusters with no significant 
interaction terms (differential splicing) [46 (p. 427)]. 
In total, we analysed 16 231 transcript clusters. We 
fit the following linear model to those transcript 
clusters: 

Ypetr = |i + ae + + /r + Qr + £petr (8) 

where is the gene-level treatment effect, is a para- 
meter that captures probesets expressed above or 
below the overall transcript cluster level and 1^ 
and Ctr are random effects for patient and chip, 
respectively. 

> aov<- 

+ aov ( intensity "-^ probeset + treatment + 

+ Error (replicate-\- 

+ replicate : treatment ) ) 

> p.F<- 

+ summary (aov) $ ''Error: Within 1 ] ] 

+ [' 'treatment' \ '' Pr(>F) 



Table 4: Top 10 differentially expressed genes from (8) 
with corrected P-values. Gene BEST4, highlighted in 
grey, appears in Figure 6 



G6ne symbol 


P-va.lu6 


LARGE 


0.00287 


IL6R 


0.00847 


RXRG 


0.00847 


BAI3 


0.00968 


Clorfl75 


0.01090 


GRIK3 


0.01090 


BEST4 


0.01090 


KCNN3 


0.01090 


PLEKHH2 


0.01090 


HOXDIO 


0.01090 



The null hypothesis is that the gene expression is 
the same in all groups. The top 10 genes, with 
adjusted P-values, appear in Table 4. The method 
used for adjusting the P-values was Benjamini- 
Hochberg's correction [47] using the function 
p. adjust (. . . , method = ' 'BR''). Only 80 of 159 
genes appearing in Gardina's list of genes up- and 
down- regulated in tumour ([14] additional file 1) 
passed our filters and were analysed for differential 
gene expression. Among those analysed, the genes 
CLDNl, SST, MUSK, KIAA1199 and SLC30A10 
were in the top 100. The gene BEST4 is shown in 
Figure 6. This gene is down-regulated in tumour 
(blue or dashed in the printed version) compared 
with normal (red or solid in the printed version) 
samples. The thicker lines represent the mean 
expression levels in the two groups. 

Gene annotation and visualization 

We chose to annotate transcript clusters to genes using 
the NetAffx transcript cluster annotation release 32 
specified above using the AnnotateGenesO function. 
Some transcript clusters present unspecific annotation 
and have several possible associated gene names. We 
have decided to remove such clusters from our 
output, as we cannot map them uniquely to a gene 
and afterwards interpret the result according to the 
gene structure. The number of transcript clusters pre- 
senting non-unique annotation to genes was 1 147 out 
of 8189 for the differential splicing analyses and 2917 
out of 16 231 in the differential expression analysis. 

After gene annotation, the user can select genes 
for visual inspection. Visual inspection of candidates 
for differential splicing is recommended by 
Affymetrix as a way to identify possible false positives 
[25] . Plots of gene profiles with integrated genomic 
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Figure 6: Profile plot of gene BEST4, which is on the reverse strand, with the gene model and transcripts retrieved 
from EnsembI [50]. The gene is under expressed in tumour (blue/dashed) compared with normal (red/solid) samples. 



information are obtained using the biomaRt [48] and 
GenomeGraphs [49] packages in Bioconductor. 

We use bioMart to connect to the latest version of 
the Homo sapiens dataset in EnsembI [50] (EnsembI 
genes 68, GRCh37.p8 at the time of writing) and 
retrieve genes by their HGNC symbol [51]. Gene 
and exon structures are imported from EnsembI. 
Gene and Exon objects are created by makeGeneO 
and makeTranscript 0 from GenomeGraphs. We 
store expression data and probeset start and stop posi- 
tions in an ExonArray object by makeExonArray () . 
The final plot is created passing a list with the 
objects created to the gdPlot (list (exon, gene, 
transcript,. . .)) function. 

Our plots show on top the gene HGNC symbol 
followed by (+) for genes on the forward strand and 
by (— ) for those on the reverse strand. Below, the plot 
of probeset intensities appears with vertical lines deli- 
miting probesets. Note that for models based on 
probe-level data (ANOSVA probe, FIPJVIA and dif- 
ferential expression), the intensities of all probes in the 
probeset (1—4) are shown. Samples from the same 
treatment group appear in the same colour, red/ 
solid for normal samples and blue/dashed for 
tumour samples in this case. For the genes detected 
by the ANOSVA probe, the ANOSVA probeset and 
the differential expression methods, thin lines show 
the expression level of each sample, whereas the 
thicker lines show the mean intensities in each of 
the groups. Immediately after the profile plot, the 



gene model retrieved from EnsembI is shown in 
orange, followed by the possible transcript model(s) 
in blue. The gene model consists of the exons that 
appear in all possible transcript models. Exons (boxes) 
in the gene model are linked by blue lines to the 
probesets above, indicating which probeset(s) inter- 
rogate which exon (Figures 2-6). 

DISCUSSION 

The aim of this article was to give a tutorial on how to 
perform a complete and reproducible analysis of exon 
array data in R/Bio conductor. We have worked with 
three packages: aroma. affymetrix, biomaRt and 
GenomeGraphs to go from .GEL files to intensity 
data, statistical analysis, annotation and visualization. 
The packages were chosen for their flexibility and ease 
of integration. We believe that our workflow covers a 
number of analysis variants for the exon array, includ- 
ing differential splicing analysis at probe and probeset- 
level and differential expression analysis at probe level, 
and gives the user the opportunity to focus on all or 
only some of the aspects of the data analysis. We make 
our entire code available so that other researchers can 
use it as it is or adapt it to their needs. 

Some possible modifications to the workflow 
include background correcting by subtracting from 
each probe the median intensity of all the exon array 
control probes with the same GC content, or 
removing noisy arrays identified in the quality 
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control step. The latter can be easily done by remov- 
ing such arrays from the sample information file and 
re-doing the background correction and normaliza- 
tion steps. A package in R for fitting linear and gen- 
eralized linear mixed-effects models is lme4 [52]. In 
our case, we used Im instead of Imer because it is 
faster, but it requires a balanced design. Finally, dif- 
ferential gene expression could be analysed using the 
gene-level estimates obtained from the plmTr object 
in other R packages such as limma [53]. Another 
extension of the workflow could include a general 
analysis strategy of the FIRMA scores, which in this 
study was tailor-made for a two-treatment scenario. 

Different Bioconductor packages could have been 
used in some of the analysis steps. For example, 
xmapcore [54] provides annotation data and cross- 
mappings between genetic features such as transcript 
clusters or exons and Affymetrix probesets. This 
package, however, requires the separate installation 
of a MySQL database, which makes this a more 
complex alternative than the one we have chosen. 
The xps package [55] could have been used for data 
preprocessing and summarization, but it requires the 
installation of the ROOT framework [56], and a certain 
level of understanding of ROOT files and ROOT 
trees is recommended. Our workflow does not 
require any prior knowledge beyond R/ 
Bioconductor. Other free software includes BP3- 
Array Tools [57], based on R, C, Fortran and Java, 
with an Excel front end, and dChip [58], which is 
written in Visual CH — h and developed for Windows, 
although some users have been able to run it on Mac 
and Unix computers. 

A previous article on exon arrays [24] suggests a 
pragmatic approach and does the analysis piecewise, 
starting with Affymetrix Power Tools (APT) and 
then exporting the data to R. We recognize this is 
a fix for the lack of straightforward packages for deal- 
ing with the exon array in Bioconductor. However, 
it implies working with several pieces of software so 
we do not find it fit for reproducible research. 
Licensed software, like Partek [40] or GeneSpring 
GX [59], has been used in other studies [22, 60]. 
In contrast to licensed software, R/Bioconductor is 
free and available for anyone, it allows the user to 
control most analysis options and it enables custo- 
mizable and reproducible analyses that are more 
easily reviewed. Still, the aroma. affymetrix package 
does not provide the speed of APT or the licensed 
software, and it requires more user input. 
Nevertheless, with this code and minimal user 



input, any dataset can be analysed regarding differ- 
ential expression at probe level and differential spli- 
cing using the ANOSVA model. 

Although the profile plots generated with 
GenomeGraphs are highly informative, they can be 
difficult to interpret for genes spanning over a long 
genomic region, for example, TGFBI in Figure 3 and 
ZAK in Figure 5. In our opinion, showing 3^ and 5^ 
ends, and exon and probeset numbers would signifi- 
cantly improve the readability of the plots. Actually, 
GenomeGraphs allows to add the Affymetrix probeset 
identity below the profile plots, but we believe that a 
probeset numbering relative to the gene over the 
profile plot would be preferable. In the future, it 
would also be interesting to study the flexibility of 
the output imported from Ensembl and, for example, 
remove the intronic regions from the gene and tran- 
script models in the graphical representations. 



SUPPLEMENTARY DATA 

Supplementary data are available online at http:// 
bib . oxfordj ournals . org/ . 
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Key points 

• The analysis of exon array data in R/Bioconductor is not yet well 
established. 

• Reproducible research is fundamental to guarantee that meth- 
ods can be compared on an equal footing, a framework for 
reproducible research is therefore needed. 

• With minor modifications, the code provided with this article 
can be used to analyse any dataset. 

• We give an overview of model-based methods for the analysis of 
differential splicing. 

• A publicly available dataset is analysed to exemplify the use of 
the code, including gene annotation and representation, and 
the methods for differential splicing. 
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